On the maximal size of Large- Average and ANOVA-fit 
Submatrices in a Gaussian Random Matrix 



Xing Sun and Andrew B. Nobel * 
June 09, 2010 

Abstract 

We investigate the maximal size of distinguished submatrices of a Gaussian random 
matrix. Of interest are submatrices whose entries have average greater than or equal to 
a positive constant, and submatrices whose entries are well- fit by a two-way ANOVA 
model. We identify size thresholds and associated (asymptotic) probability bounds 
for both large-average and ANOVA-fit submatrices. Results are obtained when the 
matrix and submatrices of interest are square, and in rectangular cases when the matrix 
submatrices of interest have fixed aspect ratios. In addition, we obtain a strong, interval 
concentration result for the size of large average submatrices in the square case. A 
simulation study shows good agreement between the observed and predicted sizes of 
large average submatrices in matrices of moderate size. 
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1 Introduction 



Gaussian random matrices (GRMs) have been a fixture in the application and theory of 
multivariate analysis for many years. Recent work in the field of random matrix theory has 
provided a wealth of information about the eigenvalues and eigenvectors of Gaussian, and 
more general, random matrices. Motivated by problems of data mining and the exploratory 
analysis of large data sets, this paper considers a different problem, namely the maximal size 
of distinguished submatrices in a GRM. Of interest are submatrices that are distinguished 
in one of two ways: (i) the average of their entries is greater than or equal to a positive 
constant or (ii) the optimal two-way ANOVA fit of their entries has average squared residual 
less than a positive constant. 

Using arguments from combinatorial probability, we identify size thresholds and associ- 
ated probability bounds for large average and ANOVA-fit submatrices. Results are obtained 
when the matrix and the submatrices of interest are square, and when the matrix and the 
submatrices of interest have fixed aspect ratios. In each case, the maximal size of a distin- 
guished submatrix grows logarithmically with the dimension of the matrix, and depends in 
a polynomial-type fashion on the inverse of the constant that constitutes the threshold of 
distinguishability. In the rectangular case, the aspect ratio of the submatrix plays a more 
critical role than the aspect ratio of the matrix itself. In addition, we obtain upper and 
lower bounds for the size of large average submatrices in the square case. In particular, for 
n x n GRMs, the size of the largest square submatrix with average greater than r > is 
eventually almost surely within in an interval of fixed width that contains the critical value 
4r _2 (lnn - ln(4T~ 2 Inn)). 

We assess our bounds for large average submatrices via a simulation study in which 
the size thresholds for large average submatrices are compared to the observed size of such 
submatrices in a Gaussian random matrix. For matrices with moderate size and aspect 
ratio, there is good agreement between the observed and predicted sizes. 

Results of the sort established here fall outside the purview of random matrix theory 
and its techniques. Nevertheless, random matrix theory does provide some insight into the 



logarithmic scale of large average submatrices. This is discussed briefly in Section 1.3 below 



1.1 Exploratory Data Analysis 

The results of this paper are motivated in part by the increasing application of ex- 
ploratory tools such as biclustering to the analysis of large data sets. To be specific, con- 
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sider an m x n data matrix X that is generated by measuring the values of m real- valued 
variables on each of n subjects or samples. The initial analysis of such data often involves 
an exploratory search for interactions among samples and variables. In genomic studies of 
cancer, sample-variable interactions can provide the basis for new insights and hypotheses 
concerning disease subtypes and genetic pathways, c.f. [HI [T7\ l6l 120 1 |2T| [25] . 

Formally, sample- variable interactions correspond to distinguished submatrices of X. 
The task of identifying such submatrices is generally referred to as biclustering, two-way 
clustering or subspace clustering in the computer science and bioinformatics literature. 
There is presently a substantial body of work on biclustering methods, based on a variety 
of submatrix criteria; overviews can be found in [13\ [TU[ [T5] and the references therein. 
In particular, the biclustering methods by Tanay et al. [24J and by Shabalin et al. |18j 
search for submatrices whose entries have a large average value, while those of Cheng and 
Church [3j and Lazzeroni and Owen |12j search for submatrices whose entries are well fit 
by a two-way ANOVA model. The effectiveness of these procedures in the analysis of real 
data is considered in [18]. 

An exact or heuristic search among the (exponentially large) family of submatrices of a 
data matrix for those that are distinguished by their average or ANOVA fit leads naturally 
to a number of statistical questions related to multiple testing. For example, how large does 
a distinguished submatrix have to be in order for it to be considered statistically significant, 
and therefore potentially worthy of scientific interest? What is the statistical significance of 
a given distinguished submatrix? Quantitative answers require an appropriate null model 
for the observed data matrix, and in many cases, a GRM model is a natural starting point 
for analysis. When a GRM null is appropriate, the results of this paper provide partial 
answers to the questions above. 

We note that answers to statistical questions like those above can have algorithmic 
implications. For example, knowing the minimal size of a significant submatrix can provide 
a useful filtering criterion for exhaustive or heuristic search procedures, or can drive the 
search procedure in a direct way. The biclustering method in [18] is based on a simple, 
Bonferroni corrected measure of statistical significance that arises in the initial analyses 
below. 

1.2 Bipartite Graphs 

Our results on large average submatrices can also be expressed in graph-theoretic terms, 
as every m x n matrix X is associated in a natural way with a bipartite graph G = (V, E). 
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In particular, the vertex set V of G is the disjoint union of two sets V\ and V2, with \ V\\ = m 
and IV2I = n, corresponding to the rows and columns of X, respectively. For each row i € V\ 
and column j e V2 there is an edge £ E with weight Xij. There are no edges between 
vertices in V± or between vertices in V%. With this association, large average submatrices 
of X are in 1:1 correspondence with subgraphs of G having large average edge- weight. The 
complexity of finding the largest subgraph of G whose average edge weight is greater than 
a threshold appears to be unknown. However, it is shown in [5] that a slight variation of 
this problem, namely finding the maximum edge weight subgraph in a general bipartite 
matrix, is NP-complete. A randomized, polynomial time algorithm that finds a subgraph 
whose edge weight is within a constant factor of the optimum is described in pQ, but this 
algorithm cannot readily be adapted to the problem considered here. 

1.3 Connections with Random Matrix Theory 

The theory of random matrices provides some insight into the relationship between large 
average submatrices and the singular value decomposition. In practice, the GRM assump- 
tion made here acts as a null hypothesis. If an observed matrix contains a large average 
submatrix whose size exceeds the thresholds given below, one may reject the GRM hypoth- 
esis, and subject the identified submatrix to further analysis. This suggest an alternative 
hypothesis, under which a fixed constant is added to every element of a select submatrix 
of the null matrix, effectively embedding a large average submatrix within a background of 
Gaussian noise. It is then natural to ask if the embedded submatrix affects the top singular 
value or singular vectors of the resulting matrix. We argue below that the answer is a 
qualified no. 

Let W be an m x n Gaussian random matrix, representing the null distribution. Define 
a rank-one matrix S = 2Tab t where r > is a fixed constant, and a S {0, l} m , b £ {0, l} n 
are indicator vectors having k and I non-zero components, respectively. The outer produce 
a b l defines a submatrix C whose rows and columns are indexed by the indicator vectors 
a and b, respectively. The matrix Y = W + S is distributed according to an alternative 
hypothesis under which the fixed constant r has been added to every entry of the submatrix 
C. 

Suppose that the dimensions m, n, k and / grow (with n, say) in such a way that the 
matrix aspect ratio m/n —> a with a £ [l,oo), and the submatrix aspect ratio k/l remains 
bounded away from zero and infinity. It is easy to see that the average of the k x I submatrix 
C in Y has distribution M(2t, (A;/) -1 ), which is greater than r with overwhelming probability 
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when k and I are large. It follows from Proposition [T] that the probability of finding a k x I 
submatrix with average greater than r in the matrix W is vanishingly small if k and I grow 
faster than logn. Thus, we might expect to see evidence of C in the first singular value, or 
the associated singular vectors, of Y. 

Given an m x n matrix U, let si(U) > ■ • • > s m {U) denote its ordered singular values, 
and let ||?7||f = ^ij u ij denote its Frobenius norm. The difference between the largest 
singular value of W and Y can be bounded as follows: 

n 

( Sl (Y) - Sl (w)f < YX°i<r) - s iW) 2 

j=l 

n 

< Y,(sj(Y-W)) 2 

3=1 

= \\Y-W\\ 2 F = \\Z\\ 2 F = T 2 kl. (1) 

The second line above follows an inequality of of Lidskii (c./. Exercise 3.5.18 of [9]), and the 
third makes use of the fact that the Frobenius norm of a matrix is the sum of the squares 
of its singular values. By a basic result of Geman [7j, 

Sl{W) (2) 



n 1 / 2 

with probability one as n tends to infinity. If k = o(m l l 2 ) and I = o(n 1 / 2 ), inequality (jl 
implies that n -1 / 2 |si(y) — si(W)| —¥ with probability one, and therefore ^ holds with Y 
in place of W. In other words, the asymptotic behavior of n~ l l 2 s\(W) is unchanged under 
the alternative Y = W + Z if the dimensions of the embedded submatrix C grow more 
slowly than n 1 / 2 . (Recall that m is asymptotically proportional to n.) 

For fixed r and k, I such that logn << k, I « ra 1 / 2 , the embedded submatrix C in Y is 
highly significant, but has no effect on the scaled limit of si(Y). Under the same conditions, 
C is also not recoverable from the top singular vectors of Y. To be precise, let u% and v\ 
be the left and right singular vectors of Y corresponding to the maximum singular value 
s\{Y). Using results of Paul |16| on the singular vectors of spiked population models, it 
can be shown that a u\ and b l vi tend to zero in probability as n tends to infinity. Thus the 
row and column index vectors of C are asymptotically orthogonal to the first left and right 
singular vectors of Y. 

1.4 Overview 

The next section contains probability bounds and a finite interval concentration result 
for the size of large average submatrices in the square case. Size thresholds and probability 
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bounds for ANOVA submatrices in the square case are presented in Section |3j Thresholds 
and bounds in the rectangular case are given in Section |4j Section [5] contains a simulation 
study for large average submatrices. Sections [6] -[8] contain the proofs of the main results. 

2 Thresholds and Bounds for Large Average Submatrices 

Let W = {wij : i,j > 1} be an infinite array of independent M(0, 1) random variables, 
and for n > 1, let W n = {wij : 1 < i,j < n} be the n x n Gaussian random matrix 
equal to upper left hand corner of W. (The almost-sure asymptotics of Theorem [l] requires 
consideration of matrices W n that are derived from a fixed, infinite array.) A submatrix 
of W n is a collection U = {wij '■ i £ A,j S B} where A, B C {1, . . . , n}. The Cartesian 
product C = A x B will be called the index set of U, and we will write U = W n [C]. The 
dimension of C is \A\ x where \A\, \B\ denote the cardinality of A and B, respectively. 
Note that the rows A need not be contiguous, and that the same is true of the columns B. 
When no ambiguity will arise, the index set C will also be referred to as a submatrix of W n . 

Definition: For any submatrix U of W n with index set C = A x B, let 

nu) = ^ £ vhj = 1^ E 

1 1 (i,i)ec* i ii i ieAJeB 

be the average of the entries of [/. Note that F(U) ~ A/"(0, |C| _1 ). 

We are interested in the maximal size of square submatrices whose averages exceed a 
fixed threshold. This motivates the following definition. 

Definition: Fix r > and n > 1. Let K T (W n ) be the largest A; > such that W n contains 
a k x k submatrix [/ with F{U) > r. 

As the rows and columns of a submatrix need not be contiguous, the statistic K T (W n ) 
is invariant under row and column permutations of W n . We may regard the Gaussian 
distribution of W n as a null hypothesis for testing an observed nxn data matrix, and K T {-) 
as a test statistic with which we can detect departures from the null. Our immediate goal is 
to obtain bounds on the probability that K T (W n ) exceeds a given threshold, and to identify 
a threshold for K T (W n ) that governs its asymptotic behavior. To this end, we begin the 
analysis of K T (W n ) using standard first moment type arguments, which are detailed below. 

Let r/ c (n,r) be the number of k x k submatrices in W n having average greater than or 
equal to r. We begin by identifying the value of k for which ETk(n,T) is approximately 
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equal to one. If S k denotes the set of all k x k submatrices of W n then 



r fe (n,r) = I{F(Wn[U])>r}, (3) 
uas k 



and consequently 



ET k (n,r) = \S k \ ■ P(F(W n [U]) > r) = (f\ \l - $(rfc)) < (4) 

where in the last step we have used a standard bound on 1 — $(•). For s S (0,n) define 

^ n , T (a) = (2vr)-in n+ i S - s -5 (n-sJ-M-^-T. (5) 

Using the Stirling approximation of (^), it is easy to see that <p n ,T(k) is an approximation 
of the square root of the final expression in Q. In particular, the rightmost expression in 
Q is less than 2(p n ^ T (k) 2 . With this in mind, let s(n, r) be any positive, real root of the 
equation 

0n,r(s) = 1. (6) 

The next result shows that s(n, r) exists and is unique, and it provides an explicit expression 
for its value when r is fixed and n is large. 

Lemma 1. Let r > be fixed. When n is sufficiently large, equation |6p has a unique root 
s(n,r), and 

4 4 / 4 \ 4 

s(n,r) = ^lnn--2ln(-^lnnj+-2+ o(l) (7) 

where o(l) — > as n — > oo. 

We show below that the asymptotic behavior of the random variables K T (W n ) is gov- 
erned by the root s(n,r) of equation Q. To begin, note that for values of k greater than 
s(n, r), the expected number of k x k submatrices U of W n with F(U) > r is less than one. 
The next proposition shows that the probability of seeing such large submatrices is small. 

Proposition 1. Let r > be fixed. For every e > 0, when n is sufficiently large, 

P(K T (W n )> S (n,r)+r) < ± n ~ 2r ^ (8) 

for every r = 1, . . . , n. 

The proofs of Lemma [T] and Proposition [T] are given in Section [6} The arguments 
are similar to those in [23], with adaptations to the present setting. A result similar to 
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Proposition[T]can also be obtained from the comparison principle for Gaussian sequences (cf. 
|19j). To be specific, fix k > 1 and note that the family of random variables {F(U) : U S Sk} 
is a Gaussian random field with m = (^) 2 elements that are pairwise positively correlated, 
and have a common jV(0, kr) distribution. Then, by the comparison principle, 

P (K T (W n ) > k) = P( max F(U) > t) < P (max{Zi, Z m } > r) , 

where Z%, . . . , Z m are independent Af(0, kr) random variables. Using Poisson approximation 
based bounds such as those in Section 4.4 of [2], one may obtain a probability upper bound 
similar to that in Q. 

It follows from Proposition [T] and the Borel Cantelli Lemma that, with probability one, 
K T {W n ) is eventually less than or equal to \s(n, r)] + 1 < s(n, r) + 2. Our principal result, 
stated in Theorem [T] below, makes use of a second moment argument in order to obtain a 
corresponding lower bound. The proof is given in Section [8j 

Theorem 1. Let W n , n > 1, be Gaussian random matrices derived from an infinite array 
W, and let r > be fixed. With probability one, when n is sufficiently large, 

s{n,r) - A _ _ 4 < K T (W n ) < s(n,r) + 2. (9) 

The difference between the upper and lower bounds in Theorem [T] is a constant that 
depends on r, but is independent of the matrix dimension n. In particular the values of the 
random variable K T (W n ) are eventually concentrated on an interval that contains s(n,r) 
and whose width is independent of n. 

The lower bound in Theorem [T] can be further improved. An examination of the argu- 
ment in Lemma [4] in the Appendix shows the inequality of the theorem still holds if the 
quantity 12 In 2 is replaced with any constant greater than 8 In 2. 

Extending earlier work of Dawande et al. [5] and Koyuturk et al. |llj . Sun and Nobel 
|22l [23] obtained a similar, two-point concentration result for the size of largest square 
submatrix of ones in an i.i.d. Bernoulli random matrix. Bollobas and Erdos [3\, and Matula 
|14j . established analogous results for the clique number of a regular random graph. (See 
|23j for additional references to work in the binary case.) The proof of Theorem [T] relies on 
a second moment argument, but differs from the proofs of these earlier results due to the 
continuous setting. In particular, the proof makes use of the fact that, under the Gaussian 
assumption made here, for any k x k submatrix U of W, there exist simple upper bound 
and lower bounds on P(F(U) > r), and that the ratio of these bounds is of order rk. 
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3 Thresholds and Bounds for ANOVA Submatrices 



In this section we derive bounds like those in Proposition [T] for the size of submatrices 
whose entries are well-fit by a two-way ANOVA model. Roughly speaking, the ANOVA 
criterion identifies submatrices whose rows (and columns) are shifts of one another. 

Definition: For a submatrix U of W n with index set A x B, define 

G(U) = min I — _ 1 _ K' " a i ~ h i ~ c ) 2 

where the minimum is taken over all real constants {en : i £ A}, {bj : j £ B} and c. 

Under the ANOVA criterion, a submatrix U will warrant interest if g(U) is less than a 
pre-defined threshold. Note that by standard arguments, 

GiJJ) = (\A\-1)(\B\-1) ^ (wij-m.-w.j+w.f, 

Vl 1 Al 1 ; i£A,j£B 
where Wi_, w.j, and w,_ denote the row, column, and the full submatrix averages, respectively. 

Definition: Given < r < 1, let L r (W n ) be the largest value of k such that W n contains 
a k x k submatrix U with G(U) < r. 

Arguments similar to those in the proof of Proposition [TJ in conjunction with a prob- 
ability upper bound on the left tail of a x 2 distribution, establish the following bound on 
L T (W n ). The proof is given in Section [7j 

Proposition 2. Let t > be fixed. For every e > 0, when n is sufficiently large, 

P(L r (W„)> t (n,r) + r) < ^y(^) 2r+2+< »- 2 ' HO) 
for every r = 1, . . . , n, where 

4 , 4 , / 4 , \ 4 

h(T) h{T) \h{T) J h{T) 

and 

h(r) = 1 -r-log(2-r). (11) 

Proposition [2] and the Borel Cantelli Lemma imply that L T (W n ) < t(n, r) + 1 eventually 
almost surely. The arguments used to lower bound K T (W n ) in Theorem [T] do not extend 
readily to L T (W n ), and we are not aware if a similar interval-concentration result holds in 
this case. 
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4 Thresholds and Bounds for Rectangular Submatrices 

The probability bounds of Proposition [T] and [2] can be extended to non-square sub- 
matrices of non-square matrices by adapting the methods of proof detailed in Sections [6] 
and [7j respectively. We present the resulting bounds below, without proof. Similar results 
concerning submatrices of Is in binary matrices can be found in [23J. 

Definition: Let W(m, n) denote an m x n Gaussian random matrix, and let a > and 
jS > 1 be fixed aspect ratios for the sample matrix and target submatrix respectively. 

a. For r > let K T (W : n, a, /3) be the largest integer k such that there exists a \f3k~\ x k 
submatrix U in W([cm~|,n) with F(U) > r. 

b. For < r < 1 let L T (W : n,a,/3) be the largest integer k such that there exists a 
\(3k] x k submatrix U in VF(|~cm],ra) with G(U) < r. 



Proposition 3. Fix r > and any e > 0. When n is sufficiently large, 

(J3+l+e)r 

for each 1 < r < n, where 



P(K T (W :n,a,P) > s(n,r,a, 0) + r) < n~^ +1 ^ r ( 



lnn 



s(n,T,a,P) = — — — -Inn — — -In 

for some constant Cx(/3,t) > 0. 



2(1 + ■ 
inn 



T 2 



2 

+ -J rna + Ci(/3,T), 



Proposition 4. Fix < r < 1 and any e > 0. When n is sufficiently large, 

lnn X(/3+l+e), 



P(L T (W : n,a 



(3) > t(n,r,a,p) + r) < rT^^ r ( 



h(r) 



for each 1 < r < n, where 
2(l + /3" 1 ) 



t(n, t, a, /3) 



h(r) 



i ^(1 + r 1 ) , 

m n —— in 

h(r) 



2(l + r 1 ) 
h(r) 



Inn 



+ h(ry l In a + C 2 (/3,r) 



/or some constant C , 2( / S,t) > 0, where h{r) is defined as in (11). 



Remark: The bounds in Propositions [3] and [4] have a similar form. In each case, the bound 
is of the form n~^ +1 ^ r times a polynomial in Inn, and the leading term in s(-) and £(•) 
are of the form (1 + /3 _1 ) Inn times a function of the threshold r. We note the critical role 
played by the aspect ratio /3 of the target submatrix. By contrast, the aspect ratio a of the 
sample matrix plays a secondary role, its logarithm appearing only in the constant term of 
s(-) and t(-). 
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5 Simulation Study for Large Average Submatrices 

The size thresholds and probability bounds presented in Sections [2] - [4] are asymptotic, 
and it is reasonable to ask if they apply to matrices of moderate size. To this end, we 
carried out a simulation study in which we compared the size of large average submatrices 
in simulated Gaussian data matrices with the bounds predicted by the theory. An exhaus- 
tive search for large average submatrices is not computationally feasible. Our study was 
based on a simple search algorithm for large average submatrices that is used in the biclus- 
tering procedure of Shabalin et al. [IB] . Analogous application of existing ANOVA based 
biclustering procedures does not appear to be straightforward, so the simulation study was 
restricted to the large average criteria. 

The search algorithm from |18| operates as follows. Given an m x re data matrix W and 
integers 1 < k < m and 1 < I < n, a random subset of I columns of W is selected. The 
sum of each row over the selected set of I columns is computed, and the rows corresponding 
to the k largest sums are selected. Then the sum of each column over the selected set 
of k rows is computed, and the columns corresponding to the / largest sums are selected. 
This alternating update of row and column sets is repeated until a fixed point is reached, 
and the average of the resulting k x / matrix is recorded. The basic search procedure is 
repeated N times, and the output of the search algorithm is the largest of the N observed 
submatrix averages. The search algorithm is not guaranteed to find the k x I submatrix of W 
with maximum average. However, the algorithm provides a lower bound on the maximum 
average value of k x I submatrices We conducted two experiments, one for square matrices 
and one for rectangular matrices. 

Square matrices. We considered matrices of size re = 200 and n = 500. Results from 
the case re = 200 are summarized in Figure [TJ For a fixed k > 1, we generated a 200 x 200 
Gaussian random matrix W, and then used the search algorithm described above to find a 
lower bound, r^, on the maximum average of the k x k submatrices of W using N = 10000 
iterations of the search procedure. Different random matrices W were generated for different 
values of k. The upper and lower bounds of Theorem [I] begin to diverge when r < 1/2, so 
we restricted attention to values of k for which > 1/2. In this case k ranged from 1 to 55. 
A linear interpolation of the pairs (t^, A;) appears as the red curve in Figure [TJ We have also 
plotted the threshold function s(n,r) derived in Lemma [TJ omitting the o(l) term, as well 
as the upper and lower bounds from Theorem [TJ As can be seen from the figure, there is 
good agreement between the observed and predicted sizes of large average submatrices. In 
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Figure 1: Results of 200 x 200 simulations 




Figure 2: Results of 500 x 500 simulations 



500 x 500 

45 | , , , , 




particular, for the range r > 1/2 the observed sizes of large average submatrices fall within 
the upper and lower bounds of the theorem. 

Simulations for matrix size n = 500 were carried out in a similar fashion. The results, 
based on N = 10000 iterations of the search procedure for each value of k, are summarized 
in Figure [2j Restricting attention to Tp. > 1/2 leads to matrix sizes k between 1 and 55 As 
in the case n = 200 there is good agreement between the observed and predicted sizes of 
large average submatrices, and the observed sizes of large average submatrices fall within 
the upper and lower bounds of Theorem [Tj 

Non-Square matrices. We also carried out two simulation studies for rectangular 
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Figure 3: Results for rectangular simulations 
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matrices of sizes 20,000 x 200 and 100,000 x 1000 (matrix aspect ratio a = 100). These 
sizes reflect those commonly seen in high-throughput genomic data. In each case, we looked 
for submatrices with aspect ratio /3 = 5 and (5 = 10. For each fixed k 6 {5, 10, 15, 20, 25}, 
we generated a Gaussian random matrix of the appropriate size and then used the search 
algorithm with N = 10000 iterations to identify (3k xk submatrices with large average. The 
results are summarized in the (interpolated) red curves of Figure [5| The theoretical upper 
bounds from Proposition [3] are plotted in blue for comparison. In each case the observed 
maxima lie below the theoretical upper bound; the gap decreases with decreasing f3 and 
increasing r. 

6 Proof of Lemma [l] and Proposition [I] 

Proof of Lemma [if Let r > be fixed, and note that 

1 1 1 T 2 S 2 1 

ln</> n>r (s) = (n + -) Inn - (s + -) Ins - (n - s + -) ln(n - s) -ln27r. (12) 

Differentiating ln</!> n;T (s) with respect to s yields 

8s 2{n-s) + [ j 2s 2 " 

The last expresssion is negative when 2r _2 lnn < s < 4r _2 lnn; we now consider the 
value of ln0 n>r (s) for s outside this interval. A straightforward calculation shows that for 
< s < 2t~ 2 Inn, 

/ sr 2 \ 1 1 

ln0 n>r (s) > s ( ln(n — 2r~ 2 Inn) In Inn — ln2r -2 J — — Ins — — ln27r, 
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which is positive when n is sufficiently large. In order to address the other extreme, note 



that from (12) we have 



hi^nrO) < s (ln(n- s) - -Ins] - 1 lns + (n + 1/2) In f— ^— \ M 3 ) 
\ 4 / 2 \n — s J 

It is easy to check that the right hand side of the above inequality is negative when s > n—2. 
Considering separately the cases s + 2 < n < (2 In 2) -1 s In s and n > (2 In 2) _1 s In s, one 
may upper bound the final term above by (s In s)/2 + (In 2)/2 and 2s + (In 2)/2, respectively. 
Thus, for s < n — 2, we have 

ln0 niT (s) < slln(n-s) In s J - - In s + 2s H h — , 

and in particular, for 4r -2 Inn < s < n — 2, 

, , . , / lns\ 1 ln2 

ln</>„, T (s) < s (2-— J - -lns + — < 

when n (and therefore s) is sufficiently large. Thus for large n there exists a unique solution 
s(n, r) of the equation (j) n . T (s) = 1 with s(n,r) G (2r~ 2 Inn, 4r~ 2 Inn). 

Taking logarithms of both sides of the equation (j) n ^ T (s) = 1 and rearranging terms yields 
the expression 

l.n n , 1. . , r 2 s 2 ln27r 

-In hnm (s H — ) ins + sinfn — s) — = . (14) 

2 n — s n — s 2 4 2 

The argument above shows that the (unique) solution of this equation belongs to the interval 
(2t~ 2 Inn, 4t~ 2 Inn), so we consider the case in which s and n/s tend to infinity with n. 



Dividing both sides of ( 14 ) by s yields 



st 2 In s 

ln(n-s) —-Ins = -1 + 0( — ), 

which, after adding and subtracting terms, can be rewritten in the equivalent form 

Inn - ^ - In Inn = In (-?-) - In ( — -) - 1 + 0( — ). (15) 



4 Vlnn/ \ n J s 

For each n > 1, define R(n) via the equation 

s(n,r) = 4r -2 Inn — 4r~ 2 In Inn + R(n). 



Plugging the last expression into (15), we find that R(n) = -^(1 — ln^-) + o(l), and the 



result follows from the uniqueness of s(n,r). 

Proof of Proposition [l| Fix r > 0. If \s(n, r)] + r > n the bound ([T]) holds trivially; in 
the case of equality, it follows from a standard Gaussian tail bound when n is sufficiently 



14 



large. Fix n > 1 for the moment and suppose that I = \s(n, r)] + r < n — 1. By Markov's 
inequality and the definition of <f> n)T (-), 

P(M T (W n )>s(n,T) + r) = P(M T (W n ) > I) 

= P(Ut(n,r)>l) 

< EUi{u,t) 

< 2< T (0 < 2^ )T (a(n,r)+r). (16) 

Let 7 = e -r2 / 4 and, to reduce notation, denote s(n, r) by s n . Under the constraint on r, a 
straightforward calculation shows that one can decompose the final term above as follows: 

2< T (* n + r) = 2 4> 2 {s n ) 1 2 ^[A n {r)B n {r)C n {r)D n {r)] 2 (17) 



where 



Ari(r) = (" r - S " " Bn (r) < r + "" 



Tl Sn / \ S>, 



C n (r) = [\^l s ") D n (r) 
r + s n 



It is enough to bound the right hand side of (17) as n increases and r = r(n) is such 



that \s(n, t)] + r < n — 1. By definition, (p njT (s n ) = 1, and for each fixed e > 0, 

max 7T-. > as n — > oo. 

r>l n -2r(2inn^)2r+e 

Thus it suffices to show that the product A n (r) B n (r) C n {r) D n (r) is uniformly bounded in 
r. To begin, note that for any fixed < 5 < 4, 

r + s n s n 4 - d 

The last term will be less than one when S is sufficiently small. The term B n (r) < 1 
for each r > 1, so it only remains to show that max r >ii n (r) • D n (r) is bounded as a 
function of n. A straightforward calculation shows that In A n (r) < r, and consequently, 

2 2 

ln^4„(r) • D n (r) < r — a quadratic function of r that is bounded from above. 



7 Proof of Proposition [2] 

For any k x k submatrix U of the Gaussian random matrix W n , it follows from standard 
arguments that (k— 1) 2 G(U) has a x 2 distribution with (k — l) 2 degrees of freedom. In order 
to bound the quantity P(G{U) < r), which arises in the analysis of L T (W n ), we require an 
initial result relating the right and left tails of the x 2 distribution. 
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Lemma 2. Suppose that X ~ xj f or some £ > 3. Then for < t < £ — 2 we have 

P{X<t) < P(X > 2£-4-t). 
Proof of Lemma [2} Let / denote the density function of X and let < t < £ — 2. Since 
P(X <t) = f(s) ds and P(X > 2£ - 4 - t) > / f(s) ds, 

Jo J2l-4-t 



it suffices to show that 



f{2£ - 4 - s) 



< 1 for all < s < £ - 2. 



(18) 



To this end, note that the ratio in (18) can be rewritten as follows: 

s (£-2)/2 e -s/2 



f{2£ - 4 - s) {21 - 4 - s)(^ 2 )/ 2 e -(^- 4 -*)/ 2 



2£ - 4 - 2s 



2£-4-s 



o 2{l-2-s)l{l-2) 



1 (/-2)/2 



1 \ 

1 ) e 2 "- 1 

u 



(<-2)/2 



with u 



As s tends to £ — 2, -u tends to infinity, and therefore 



,., w . lim I 1 ] e 2 *-i 

s^(e-2) f(2£ - 4 - S) u-^oo V u . 



lim 



2£ - 4 - s 
2£-A-2s 



1. 



(19) 



Thus, it suffices to show that for u E (1, oo), the final term in (19) is an increasing function 
of u. Differentiating with respect to u we find that 



d ( 1 \ _a_ 
— 1 I e 2 "- 1 



(2u- l) 2 -4(«- l)u 
n 2 (2n- l) 2 



e2 «-i > o 



where the inequality follows from the fact that u > 1. Inequality (18) follows immediately. 



Proof of Proposition [2J To begin, note that if X has a x 2 distribution with £ degrees of 
freedom, then by a standard Chernoff bound, 



P(X >r)< min (1 - 2s)" 2 e" 



0<s<| 



-£/2 



(20) 



Let r > be fixed. Fix n > 1 for the moment and let r > 1 be such that k = 
\t(n, t)~\ + r < n, where t(n, r) is defined as in the statement of Proposition [2] Let U be 
any k x k submatrix of W n , and let £ = [k — l) 2 . As noted above, the random variable 
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£G(U) has a x 2 distribution with £ degrees of freedom, so by Lemma [2] and inequality (20), 

P(G{U)<t) = P(£G(U) < It) < P(£G(U) > (2 - t)£ - 4) 

~(2-r)£-4 



< 



exp 



exp 



1 + ln 



(2-t)£-4 



[(1 - r) - ln(2 - r)] exp 



2 + In 1 



1(2 - r) 



One may readily show that the second term above is 0(1). It then follows from a first 
moment argument that 

2 / \ 2 / \ 2 



P(L r (^ n )>fc) < Q P(G(C/)<r) < 1 (k - 1)2 < C ^ 1 



n 2 (21) 



where C is a finite constant and 



'/ = exp<| -[-(l- r )+ln(2-r) 



Fix e > 0. By following the proofs of Lemma [T] and Proposition [TJ replacing r 2 with 
/i(t) = 1 — t — ln(2 — r), one can show that for every r > 1 such that 



Inn 



In 



fc(r) /i(r) \h(r) 



Inn ) + 



h(r) 



+ 2 + r 



is at most n, we have 



n 

k-1 



fe(r) \h(r) 



Inn 



2r+2+e 



n 



-2r-2 



and the result then follows from (21). 



8 Proof of Theorem [U 

In what follows we make use of standard bounds on the tails of the Gaussian distribution, 
namely that (3s) -1 e~ s2//2 < 1 — 3>(s) < s~ 1 e~ s2 / 2 for s > 3. The proof of Theorem [l] is 
based on several preliminary results. The first result bounds the ratio of the variance of 
TfcfY, n) and the square of its expected value, a quantity that later arises from an application 
of Chebyshev's inequality. 

Lemma 3. Fix r > 0. There exist integers no, ko > 1 and a positive constant C depending 
on t but independent of k and n, such that for any n > hq and any k > ko, 

. k k (k\ (n—k\ (k\fn—k\ 

Var r k (T,n) mA^^ UU-lj UU-rJ 

( E r k (T,nW - C " hh (I) (I) 6XP 



(rlT 2 f k 2 -rl\) 
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Proof: Let S k denote the collection of all k x k submatrices of W n . It is clear that 

2 



ET k (n,r) = P(F(U) >r)= (f) (1 - $ 



(fcr)) . (23) 



u<=s k 

In a similar fashion, we have 

ET 2 k {n,T) = P(F(Ui) > t and F(Uj) > t) 

Note that the joint probability in the last display depends only on the overlap between the 
submatrices f7, and Uj. For 1 < r, I < k define 

G(r,l) = P(F(U) > t and F(V) > r) 

where U and V are two fixed kxk submatrices of W having r rows and I columns in common. 
Then ET k (n,T) = (^) 2 G(0, 0) 1//2 , and a straightforward counting argument shows that 

"fe"-SS(W(::^(r-!)^ 

In particular, 

Varr fc (ra,r) A (t) Cj (fc-r) f gfa Q 

(^(n,r)) 2 ^ (I) (J) \G(0,0) 

hh (*) o vg(o,o) 

where we have used the fact that (*)( 

is a probability mass function, and that 
G(0,/) = G(r,0) = G(0,0). When fcr > 3 we have G(0,0) = (1 - </>(A:t)) 2 > (3fcr)- 2 e - fc2r2 , 
and it therefore suffices to show that for 1 < r, I < k, 

G(r,J) < ^^exp{-^V + ^(l + ^)} (24) 



where G > depends on r but is independent of k and n. Inequality (24) is readily 
established when r = I = k, so we turn our attention to bounding G(r, /) when 1 < r/ < A; 2 . 
In this case 

/w r 00 r «a / k 2 r-rie 2 



where [/, V are submatrices of W„ having r rows and I columns in common. Let <&(x) = 
1 - $>(x). Note that G(r,l) = D + D x where 

D » " wJ-J^* [7P^i) I{kT - r ' t<1}dt <25) 
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and 



Di 



frl f°° _rul-2 k 2 T -rlt 
-= f e 2 <J> 

^tt 7-oo V VA; 2 - rZ 



/{/c 2 r - rZt > 1} dt. 



(26) 



Consider first the term L>i denned in (26). As rl ^ k 2 and k 2 r — rlt > 1, the normal 
tail bound yields 



V /c 2 — rZ 



\/k 2 — rl 



V2n(k 2 T - rlt) 
0{Vk 2 -rl) exp 



cxp 



(fc 2 r - rZt) 2 
2{k 2 -rl) 

(k 2 r - rlt) 2 
2(k 2 -rl) 



Plugging the last expression into (26), the exponential part of the resulting integrand is 



(k 2 r - rlt) 2 rlt 2 
(k 2 - rl) 2~' 

which (after lengthy but straightforward algebra) can be expressed as 



-k 2 r + 



It then follows that 



rZr z 



1 + 



k 2 — rl\ rl(k 2 + rl) 



k 2 +rl) 2(k 2 -rl) 



T-t)+T 



k — rl 
k 2 + rl 



D 1 < 0{k A - rl) exp \ -k 2 r 2 + ( 1 + 



rZr 2 



k — rl 
k 2 + rl 

rl(k 2 +rl) 



lk 2 -rl f°° lrl(k 2 +rl) I rl(k 2 + rl) ( r(k 2 - rlY . , , 

<x, w+ri x L V^7T exp r2(A^wy( r - t+ ^ 2 T7r 1 ),u 



lrl(k 2 + rl) 
k 2 — rl 

The term preceding the integral is less than one, and the integral is equal to one. Thus D\ 



is less than the right side of (24) 



We next consider the term Dq defined in (25). Note that k 2 r — rlt < 1 is equivalent to 
t > [k 2 T — l)/rZ, and therefore 



rl 



(k 2 T-l)/rl V27T 



e 2 dt = $ 



A; 2 r-1 



rl 



< 



k\/ rl 



2ir(k 2 T - 1) 



(fc 2 r-l) 2 
-e 2ri 111 



Comparing the last term above with (24), it suffices to show that when k is sufficiently 
large, 

^ + n fc > fc 2 r 2 

2rZ - V 2 / 

or equivalently 

(k 2 -rl) 2 r 2 - 2k 2 r + 1 + 2rZln/c > 0. (27) 

Suppose first that rl > k 2 — k/y/ln k. In this case, the left side of the expression above is 
at least 



-2Zc 2 r + 1 + 2rZln/c > -2Zc 2 r + 1 + 2(k 2 - k/V\^k) Ink > 
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when k is sufficiently large. Suppose now that k 2 — rl > k/V\nk. As a quadratic function of 
r, the left side of (27) takes its minimum at r = k 2 / (k 2 — rl) 2 , and the corresponding value 



is rl [—2k 2 + rl + 2(k 2 — rl) 2 In k]/(k 2 — rl) 2 . In this case, the assumption k 2 — rl > k/Vln k 



implies 



-2k 2 + rl + 2{k 2 - rl) 2 In k > rl > 0. 



This establishes (27) and complete the proof. 



Lemma 4. Let t > be fixed. When k is sufficiently large, for every integer n satisfying 
the condition 

4 , 4 / 4 , \ 12 In 2 

k < Inn - In — ^ Inn — ?. — (28) 

we have the bound 



Var T k (r,n) 



< k- 2 . 



(ET k (T,n)) 2 

Remark: For the proof of Theorem [TJ it is enough to show that the sum over k of the ratio 
above is finite, and for this purpose the upper bound k~ 2 is sufficient. 



Proof: Let n satisfy the condition ( |28[ ). By Lemma |3j it suffices to show that 

k k /k\ (n—k\ (k\ (n—k\ 
l) \k-U \r) \k-r) 



exp 



rlr 



1 + 



k — rl 
k 2 + rl 



< k- z . 



(29) 



1=1 r =l Ct) (ft) 

In order to establish ( |29[ ), we will show that each term in the sum is less than A;~ 8 . To 
begin, note that 



(»(£» < G)k l (n-k) 



k-l 



k l (n-ky 



(I) ~ {n-kf \l t 
and that (n — k)~ l = 0(n~ l ) when I < k = 0(n 1 ^ 2 ). Thus for some constant C > 0, 



CD 



< a \ " 1 1 U r+ ' n - (r+,) . 



n ( r+ ^ exp 



1 

rlr 2 



Rewriting (28) as Inn > ^ — h ln(^ Inn) + 3 In 2 yields the bound 

k 2 -rV 



1 + 



k 2 + rl 



Combining the last three displays, and using the fact that k < \ Inn by assumption, it 
suffices to show that 
'k\ fk\ 



-3(r*+01n2 JlV w _ 

r y V I J I 2 V fc 2 + W 2 



(r + Z) 



< A; 



-8 



(30) 
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In order to establish (30), we consider two cases for r + 1. Suppose first that r + I < 



By elementary arguments 



;)C) £ C 2 + ; W + ' » d ^ 

It follows from these inequalities that 



2k 2 , (r + /) 2 2k 2 , (r + /) 2 



A; 2 + W 



A; 2 + rZ 











< 


exp 




exp 


< 


exp 



exp 

r 2 
2 



rZ- 



2k 2 



k 2 + rZ 
(r + Z) 2 k 



[(r + l) 



(r + l) 



+ (r + l) ln2k 



r 2 (r + l) 



T 2 (r + l) 



(r + l) 



k 2 In 2k 

+ s— 



3k k 2\u2k 

-7T - ~ + 5— 



As the exponent above is negative when k is sufficiently large, (30) follows. Suppose now 
that r + l > From the simple bounds r + 1 > 2\/rl and k 2 + rl > 2\J k 2 rl we find that 

2k 2 



rl 



k 2rlk 2 



kv rl 



0. 



and it suffices to bound the initial terms in (30). But clearly, 



C)G) e ~ 3(r+oin2 - 22fc - 2-f ' 



which is less than k 8 when k is sufficiently large. 



Proof of Theorem [TJ Proposition [T] and the Borel-Cantelli lemma imply that eventually 
almost surely K T (W n ) < \s(n, r)] +1. Thus, we only need to establish an almost sure lower 
bound on K T (W n ). To this end, define functions 



4 4/4 
f(n) = — g In n ^ In I — ^ In n 



^ — and g(k) = min{n > 1, I f(n) \ = k} 

T 



for integers n > 1 and k > 1, respectively. It is easy to see that f(n) is strictly increasing for 
large values of n, and clearly f(n) tends to infinity as n tends to infinity. A straightforward 
argument shows that g(k) has the same properties Thus for every sufficiently large integer 
n, there exists a unique integer k = k(n) such that g(k) < n < g(k + 1). 

Fix m > 1 and consider the event A m that for some n > m the random variable K T (W n ) 
is less than the lower bound specified in the statement of the theorem. More precisely, define 

12 In 2 I 

\K T [W n ) < s(n,r)- 

n>m 



A m = (J {k t (W u ) < s(n,r) 
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To establish the lower bound, it suffices to show that P(A m ) — > as m — > oo. To begin, 
note that when m is large 

A m C (J (J W„) < s^r)- 12 ^- A_ 4 |. 

fc>L/MJ 9(k)<n<g(k+l) ^ > * ) 

Fix n > m sufficiently large, and let k = k(n) be the unique integer such that g(k) < n < 
g(k + 1). The definition of g(k) and the monotonicity of /(•) ensures that k = \_f(g(k))\ < 
/(n) < k + 1. In conjunction with the definition of f(n) and Lemma [TJ this inequality 
implies that 

1 = k + l-k > f(n)-[f(g(k))\ > f( n )-f(g(k)) 
= a(n,T)-s(g(k),r) + o(l), 

and therefore s(n, r) < s(g(k),r) + 1 + o(l). Define 

12 In 2 4 



r{k) 



s(g(k),T) 



From the bound on s(n,r) above and the fact that K T (W g ^) < K T (W n ), we have 
^K T (W n ) < s (n,r)-^^-^-3| C {K T {W g{k) ) < r(k) - 1 + o(l)} 



C 



{^r(W ff(fc) ) < r(fc) - 1} 



where the last relation makes use of the fact that K T and r(fc) are integers. Thus we find 
that 

A m C |J {K r (Ty g(fc) ) < r(A:)-l}. 

fc>L/(m)J 

Consider the events above. For fixed k, 

Var T r iu\(T, q(k)) 

P(K T (W g(k) )<r(k)-l) = P(T r{k) (r,g(k)) = 0) < r( f " (31) 

1 r(fc)l r 5 9\ K ))) 

where we have used the fact that for a non-negative integer-valued random variable X 

VarX 
(EX) 

by Chebyshev's inequality. As r(k) < f(g(k)), Lemma |4] ensures that the final term in (31) 



P(X = 0) < P(\X-EX\>EX) < iu 



is less than k 2 , and the Borel-Cantelli lemma then implies that P(A m ) — > as m — > oo. 
This completes the proof of Theorem [TJ 
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